Scalable Nonparametric Bayesian Multilevel Clustering

نویسندگان

  • Viet Huynh
  • Dinh Q. Phung
  • Svetha Venkatesh
  • XuanLong Nguyen
  • Matthew D. Hoffman
  • Hung Hai Bui
چکیده

Multilevel clustering problems where the content and contextual information are jointly clustered are ubiquitous in modern datasets. Existing works on this problem are limited to small datasets due to the use of the Gibbs sampler. We address the problem of scaling up multilevel clustering under a Bayesian nonparametric setting, extending the MC2 model proposed in (Nguyen et al., 2014). We ground our approach in structured mean-field and stochastic variational inference (SVI) and develop a treestructured SVI algorithm that exploits the interplay between content and context modeling. Our new algorithm avoids the need to repeatedly go through the corpus as in Gibbs sampler. More crucially, our method is immediately amendable to parallelization, facilitating a scalable distributed implementation on the Apache Spark platform. We conduct extensive experiments in a variety of domains including text, images, and real-world user application activities. Direct comparison with the Gibbs-sampler demonstrates that our method is an order-ofmagnitude faster without loss of model quality. Our Spark-based implementation gains another order-of-magnitude speedup and can scale to large real-world datasets containing millions of documents and groups.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Nonparametric Multilevel Clustering with Group-Level Contexts

We present a Bayesian nonparametric framework for multilevel clustering which utilizes grouplevel context information to simultaneously discover low-dimensional structures of the group contents and partitions groups into clusters. Using the Dirichlet process as the building block, our model constructs a product base-measure with a nested structure to accommodate content and context observations...

متن کامل

Small-Variance Asymptotics for Bayesian Nonparametric Models with Constraints

The users often have additional knowledge when Bayesian nonparametric models (BNP) are employed, e.g. for clustering there may be prior knowledge that some of the data instances should be in the same cluster (must-link constraint) or in different clusters (cannot-link constraint), and similarly for topic modeling some words should be grouped together or separately because of an underlying seman...

متن کامل

Small-Variance Asymptotics for Hidden Markov Models

Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its infinite-state Bayesian nonparametric extension. Starting with the standard HMM, we first derive a “hard” inference algorithm analogous to k-means that arises when particular var...

متن کامل

Bayesian Hierarchical Clustering with Exponential Family: Small-Variance Asymptotics and Reducibility

Bayesian hierarchical clustering (BHC) is an agglomerative clustering method, where a probabilistic model is defined and its marginal likelihoods are evaluated to decide which clusters to merge. While BHC provides a few advantages over traditional distance-based agglomerative clustering algorithms, successive evaluation of marginal likelihoods and careful hyperparameter tuning are cumbersome an...

متن کامل

Supplementary Material for Bayesian Nonparametric Multilevel Clustering with Contexts

Vu Nguyen†, Dinh Phung†, XuanLong Nguyen‡, S. Venkatesh†, and Hung Bui∗ †Centre for Pattern Recognition and Data Analytics (PRaDA), Deakin University, Australia. {tvnguye,dinh.phung,svetha.venkatesh}@deakin.edu.au ‡Department of Statistics, Dept of Electrical Engineering and Computer Science University of Michigan. [email protected] ∗Laboratory for Natural Language Understanding, Nuance Commun...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016